Model Selection

High-resolution visual understanding

# High-resolution visual understanding

Llava UHD V2 Vicuna 7B

LLaVA-UHD v2 is an advanced multimodal large language model built around a hierarchical window transformer, capable of capturing different visual granularities through a high-resolution feature pyramid.

Multimodal Fusion

CLIP Convnext Large D 320.laion2B S29b B131k Ft

CLIP model based on ConvNeXt-Large architecture, trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase